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Abstract 

Screening is the problem of estimating a superset of the set of non-zero entries in an unknown 
p-dimensional vector /3* given n noisy observations. In the high-dimensional regime, where 
p > n, screening algorithms are useful for reducing the dimensionality of This allows for 
using computationally challenging algorithms, such as cross-validation or stability selection, 
for estimating /3* and/or the set of non-zero entries in We propose a novel framework 
for screening, which we refer to as Multiple Grouping (MuG), that groups variables, performs 
variable selection over the groups, and repeats this process multiple number of times to estimate 
a sequence of sets that contains the non-zero entries in /?*. Screening is done by taking an 
intersection of all these estimated sets. We show how MuG, when used in conjunction with 
the group Lasso estimator, can consistently perform screening and reduce the dimensionality of 
/3*. The main advantage of using MuG in conjunction with the group Lasso is that it leads to 
a parameter-free screening algorithm so that screening can be done without using any tuning 
parameter. Our numerical simulations clearly show the merits of applying the MuG framework 
in practice. 
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1. Introduction 

Let /3* E R^'^^ be an unknown p-dimensional vector. Let y G I 
vector that captures information about f3* using the linear model 

y = Xp* + w, 

hnxp 



nx 1 



be a known n-dimensional 



(1) 



where X G M"^P is a known design matrix and w is measurement noise. Equation ([T|) is well 
studied in the literature owing to its application in many real world problems. For example, in 
compressive sensing, it is of interest to measure a signal (image, video, or sound) /3* using only 
a few measurements using the line ar model ^ with a suitable choice of the design matrix X 
(|Candes et al.l . bood : IPonohol . hood ). Given gene expression data, where typically the number of 
observations n is much sma ller than the tota l number of genes p, it is of interest to study the 
relationships between genes (IWille et al.l . l2004l 'l. These relationships are captured in the vector (3* 
and the observations y an d X are known. A similar problem of estimating relationships arises when 



modeling economic data ( Fan et ah . 2011bl ) 



All the problems mentioned above are typically high-dimensional such that p > n. This means 
that the least-squares estimate is not possible to compute. An alternative is to impose structure on 
P* and a common structure is to assume /3* is sparse so that only a few entries in (3* are dominant 
and the rest are zero. Under the sparsity assumption, one way of estimating 0* is b y solving an 
£i-regularized least-squares problem, often referred to as the Lasso ( Tibshirani . 19961 ): 



Lasso: ^°(A) 



arg mm < — 

" - 2n 



\y-Xl3\\l + Xm\, 



(2) 
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Given an estimate I3^{X), let S^{X) be the support of /3(A) so that it contains all indices that are 
non-zero: 

50(A) = {j:^,(A)/0}. (3) 
Throughout this paper, we assume f3* is /c-sparse with support 5* such that 



S* = {i: P* / 0} s.t. \S* 



k. 



(4) 



Properties of the solutions 13^ (X) and 50(A) are well stud i ed in the literature, see (IZhao and Yul. 
2006 : Meinshausen and Biihlmann . 2006 : Wainwright . 200S : Meinshausen and Yu . 20091 : Negahban et al. 



20 id ) for some of these results and (jBiihlmann and Van De Geeij . l201ll ) for an extensive review of 
these results. Informally, these results establish conditions under which there exists a A such that 
the estimates /30(A) and 50(A) are consistently. The choice of A depends on the statistics of the 
noise w, which is either unknown or difficult to estimate, especially when p > n. Alternatively, if 
k is known, A can be selected by solving ([2]) for multiple different values and choosing a solution 
/30(A) that is fc-sparse. Unfortunately, k is typically unknown and difficult to estimate accurately. 



1.1 Motivation 

It is clear that choosing an appropriate tuning parameter A when solving ([2]) is non-trivial since 
the optimal choice of A depends on the unknown signal /3* and the unknown statistics of the 
measurement noise w. In prac tice, it is common to use model sel ection algorithms, such as cross- 
validation, stability selection ^Meinshausen and BiihlmamJ . H) , or other information criterion 
based methods for estimating A. Both cross-validation and stability selection can be computa- 
tionally challenging to implement and require specifying suitable regions where an optimal A may 
be found. Although information-criterion based methods, such as the Bayesian information cri- 
terion (BIG), are computationally t ractable, they are not suitable for high-dime nsional problems 
( Meinshausen and Biihlmann , 2010l ). An extension of BIG, referred to as EBIG ( Chen and Ghen . 
20081 ). is suitable for high-dimensional problems. However, the performance of EBIG depends on a 
suitable choice of a parameter that controls the number of variables selected. 

The problem of choosing an algorithmic tuning parameter is not limited to the Lasso. Other 
algorith ms for estima ting /3* under the sparsity constraint, such as fo r ward-backward se l ection 
fFoBal dZhand 



20 



Tropp and Gilbertl 



ll), o rthogonal matching pursui t (OMP) (Pati et al 



20071). or multi-stage algorithms (IZoul.1 2006: B achl . l2008l : IWasserman and Roeder 



1993: 



Davis et al 



1997: 



20091 : iMeinshausen and Biihlmannl . l2010l ; Ivan de Geer et al.l . |2011I ). also require specifying a tuning 
parameter. 

One way of reducing the computational burden of model selection algori thms is to fir s t estim ate 
a superset of 5*. This is known as variable screening or simply screening ( Fan and Lv . 20081 ). In 
other words, we want to estimate a superset of 5*, say 5, such that |5| < p. Given that 5* C 5, 
we can rewrite the model in ([1]) as 

y = Xs(3^ + w, (5) 

where = (^Xg : s G 5) is a n x |5| matrix and Xi is the ith column of X. Thus, screening 
reduces the dimensionality of the problem from p to \S\. This can lead to significant reductions in 
using model selection algorithms when |5| <^ p. Gurrent algorithms for screening, which we review 



^Consistency of /3°(A) does not necessarily imply consistency of ^"(A). Thus, the A chosen to achieve consistency 
of 13° (X) and S°{\) can be different. 
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in Section [1.31 require specifying a suitable tuning parameter. Our main contribution in this paper 
is to propose a framework for screening that can be used to estimate S in a parameter-free way so 
that estimating S does not require specifying a tuning parameter. 

1.2 Main Contributions 

The main contributions in this paper are summarized as fohows. 

Multiple Grouping (MuG): We propose a general framework for screening that groups variables, 
performs variable selection over the groups, and repeats this process multiple number of times over 
different choices of the groupings. The final estimated superset is the intersection of the supports 
estimated over each grouping. We refer to our framework as Multiple Grouping (MuG). The main 
intuition behind MuG is that if a variable v is selected in one iteration, it may not be selected in 
another iteration since the variable may be grouped with other variables that are all zero. Figured] 
illustrates MuG using a simple example. 

MuG Using Group Lasso: The MuG framework can be applied in conjunction with group based 
variable selecti o n alg orithms. We study the application of the MuG framework with group Lasso 
( Yuan and Linl . 120071 ). which uses a modification of the Lasso to perform variable selection over 



groups. 

Parameter Free Screening: Using properties of the Lasso and the group Lasso, we show that 
when p > n, MuG with group Lasso can perform screening without using a tuning parameter in such 
a way that IS*! < n. Theoretically, we establish conditions under which MuG is high-dimensional 
consistent so that P [S* C 5") — )• 1 as n,p — )• oo. 



1.3 Related Work on Screening Algorithms 

Fan and Lv out line a screening algorit hm, referred to as sure independence screening (SIS) 

or marginal regression dGenovese et a l.'. '2012^ ), thalFI thr e shold s IX'^t/I to find S. Extensions of SIS 



have been proposed in (jFan et al.l . 12009. : Fan and Songj . l2010l : iFan et al.l . l2011al : iKe et al.l . |2012| ) 



The performance of SIS is sensitive to the choice of the threshold and an appropriate choice of 
the threshold depends on the unknown parameters of the underlying system. The main advantage 
of MuG over SIS is that, when p > n, screening may be done without using a tuning parameter 
or a threshold. Moreover, our numerical simulations clearly show that the MuG framework can 
discard more variables when compared to SIS. Since the SIS algorithm is computationally fast, it 
may be used in conjunction with the MuG framework by choosing a small threshold to trade-off 
computational com plexity and accuracy. 



Recent works in (iTibshirani et al.l . l201ll : lEl Ghaoui et al.l . l201ll : IXiang et al.l . l201ll : IXiang and Ramadge 



2OI2I ) have analyzed the solutions of the Lasso to derive rules for discarding variables when solving 
the Lasso in ([2]) for a particular A. These algorithms perform screening for the estimate S^{X) and 
the algorithms extend to the group Lasso estimator. Our work differs from this work since we per- 
form screening to find a superset of the true support S*. Regardless, when using the Lasso and the 



group Lasso with the MuG framework, the algorithrn s in (jTibshirani et al.l . 1201 ll : lEl Ghaoui et al. 



201 ll : IXiang et all . I2OI1I : IXiang and Ramadgd . l2012l ) can be used to improve the computational 
complexity of solving the Lasso and group Lasso problems. 

Another approach to parameter free screening is to use properties of variable selection algo- 



Assuming the columns of X are normalized so that \\Xi\ 
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Figure 1: An illustration of the Multiple Grouping (MuG) framework. The true support is S* 
{1, 5, 8} and the estimated superset is 5 = {1, 5, 7, 8}. 



rithms such as the Lasso. It is known that the Lasso can only select at most min{n,p} variables 
( Osborne et al. . 200d : E^u and Zhang . 20091 ). This means that screening can be done by selecting 
a A in ([2]) such that |5''(A)| = n. In our proposed algorithm of using MuG with group Lasso 
(see Algorithm [2|) , we use the Lasso to do parameter free screening and then use the group Lasso 
multiple number of times to further screen for variables. We find conditions under which the 
group Lasso estimator can remove variables from the Lasso estimator. Moreover, our numerical 
simulations clearly show the improvements in using the group Lasso estimator after using the 
Lasso. Finally, it is known that cro ss-validated solutions to the Lasso have the screening property 
(M einshausen and Biihlmann . 20061 ). Thus, we can use the cross- validated Lasso solutions with the 
MuG solutions to find another estimate of the superset of S* . 



1.4 Organization 

The rest of the paper is organized as follows: 

• Section [2] presents the the MuG framework with respect to an abstract group based variable 
selection algorithm. 

• Section [3] shows how MuG can be used with the group Lasso estimator. 

• Section U] outlines conditions under which MuG leads to a screening algorithm that is high- 
dimensional consistent. 

• Section [5] presents numerical simulations that show the advantages of using MuG in practice 
and compares MuG to other screening algorithms. 

• Section [6] discusses some extensions of the MuG framework. 

• Section [7] summarizes the paper. 



2. Multiple Grouping (MuG) for Screening: Overview of Algorithm 

In this Section, we give an overview of the Multiple Grouping (MuG) framework when used in 
conjunction with an abstract variable selection algorithm. Let V = index /3* defined 

in ([1]). Define a collection of K partitions or groupings of V: 



g* = {Gl,...,Gj;J , 1 < <m«n 



(6) 



4 



[jG] = V (7) 
G«nG;.^ = 0,ii,j2e{i,...,^i^}. (8) 

We have assumed that each group G*- has at least one element and at most m elements, where m 
is small when compared to n and p. Moreover, the groups in a grouping are chosen such that 
they are disjoint and all elements in V are mapped to a group in Q^. Let Alg be a generic variable 
selection algorithm: 

S^ = Mg{y,X,X,g^) (9) 
y = Observations in ([TJ (10) 

X = Design matrix in ([1]) (11) 
A = Tuning parameter (12) 

g' = Defined in dS])-® (13) 

The set 5* is an estimate of the true support S* . We assume that, under certain conditions, 
Alg can select all groups G*- such that ^ 0. The multiple grouping (MuG) framework for 

variable selection is to apply the variable selection algorithm Alg over multiple groupings ^' to 
obtain a sequence of estimates {S^, . . . , S^}. The final estimated superset of the support S* is 
the intersection of all the estimates. Algorithm [1] summarizes the MuG framework and Figure [1] 
illustrates MuG using K = 2. 

Algorithm 1: Multiple Grouping (MuG) 
- Compute 5* for i = 1, . . . , using ([9]). 

K 



Return 5 = Q 5*. 



i=l 



Typical applications of group based variable selection algorithms assume that it is known a 
priori which groups of variables in (3* are non-zero or zero. Our setting is different since we assume 
that 13* is sparse (and not necessarily group sparse) and group variables to estimate a superset of 
the true support. Since the groupings can be chosen arbitrarily, we repeat this process multiple 
number of times using different groupings and take an intersection over all the estimates to find the 
final estimated superset of the true support. Note that the MuG framework can be easily extended 
to cases where /3* is known to be group sparse, see Section [6] for more details. The choice of the 
various parameters, namely A, K, and are briefly discussed next. 

Choosing A: The parameter A controls the number of variables selected in each iteration of the 
MuG framework. We want to choose A so that all variables in S* are included in each with high 
probability while 5* is as small as possible. This will ensure that S* ^ S with high probability. One 
way of doing this is by carefully choosing A using some model selection algorithm, such as cross- 
validation. However, this can be computationally challenging and probably no different than using 
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cross-validation in the original variable selection algorithm without any groups. In our proposed 
approach for parameter free screening, we make use of the property that solutions to Lasso based 
algorithms select at most n variables. When selecting groups, this means that at most n groups 
will be selected. This allows for choosing A without using any model selection algorithm. 

Choosing K: The parameter K controls the number of groupings we form in the MuG framework. 
It is clear than \S\ decreases or remains the same as K increases. However, we do not want K to 
be too large since, with small probability, there may exist a grouping for which we may discard 
an element in S* . On the other hand, choosing K to be too small may not result in significant 
reduction in dimensionality. We show that when using Lasso based variable selection algorithms, 
choosing K such that K/{p/m)'^ — t- is sufficient to ensure consistency of the screening algorithm. 

Choosing Q'^: We discuss two methods for choosing . The ffist method chooses by randomly 
partitioning the set of indices V . The second method chooses Q'^~^^ using the estimates 5', . . . , 5^. 
Our numerical simulations compare both these methods and also discusses the trade-offs in choosing 
m, i.e., the maximum size of the groups in Q"^ . 



3. MuG Using Group Lasso 

So far, we have given an overview of the Multiple Grouping (MuG) framework using an abstract 
variable selection algorithm. In thi s Section, we show ho w MuG can be used with the group Lasso 
estimator, which was proposed in (|Yuan and Linl . bon?! ) as an extension to the Lasso for variable 



selection and prediction given prior knowledge about groups of variables that are either zero or 
non-zero. Section [3.11 outlines the MuG framework using the group Lasso. Section [3.21 analyzes the 
MuG framework and gives some insight as to why the MuG framework may successfully perform 
screening. Section 13.31 presents an algorithm for grouping variables and empirically evaluates the 
algorithm using a simple numerical example. 



3.1 Main Algorithm 

Let ^* be a grouping defined in ©-([I]) with di groups. The weighted (l,z^)— norm, using the 
grouping is defined as follows: 

ll/5lb^. = E^/^ll/5Gjll-' (14) 

where rriij = The group Lasso, first proposed in (|Yuan and Linl . l2007l ). solves the following 

optimization problem: 

Group Lasso: ^^(A) = arg min - + Apllg.^a} • (15) 

When G* = {j}, the group Lasso reduces to the Lasso in ([2]). In the literature, the group Lasso is 
also referred to as ^1/^2-regularized least squares or £1 /£jy-regularized least squares when 2 is 

replaced by ||/3||gi^,y. Let ^^(A) be the support of /3*(A). Let the support over the groups be Sgi{X) 
such that 

Sg.{\) = {G]:Py{X)^^}. (16) 
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In other words, Sgi{X) is t he set of groups in selected by the group Lasso estimator. The 
following Lemma, proved in ( Liu and Zhang . 20091 ) . characterizes the cardinality of Sgi{X). 



Lemma 3.1 (iLiu and Zhand (jioO^)). For all A > 0, |S'gi(A)| < mm{n,di}, where di is the number 
of groups in the grouping Qi. Moreover, if G\, . . . ,0^^^ are the groups selected by the group Lasso 
estimator /3*(A) in \15\) . then the vectors 



(17) 



are linearly independent. 



Using Lemma ETU we see that the Lasso can select at most min{n,p} variables and the group 
Lasso can select at most min{n, di} groups of variables. The second part of Lemma |3. II identifies a 
relationship between the estimates and the columns of X. The exact maximum number of groups 
or variables selected depend on the columns of X. For example, when solving the Lasso, if n — 1 
columns are linearly independent and the rest of the columns are linear dependent on one of the 
other columns, then the Lasso can select at most n — 1 variables. 



Algorithm 2: MuG using Group Lasso 

- Assume p > n. 

- Solve ([2]) and choosciAo s.t. |5°(Ao)| = n. 

- Initialize S = S^{Xo)- 

- FoTi = l,...,K 

- Choose a grouping that satisfies (H])-® and di > n. 

- Solve (fTSj) using and choose^ Aj s.t. |S'gi(A)| = n 

- Let S'^{Xi) be the support of the group Lasso estimator and update S: 

S = SnS'{Xi) 



When p > n, we can easily perform screening by solving the Lasso problem in ([2]) to select 
at most n variables. Using the MuG framework, we may further reduce the dimensionality of the 
problem. Algorithm [2] outlines the MuG framework when used in conjunction with the group Lasso 
estimator in ()15p . We first solve the Lasso by choosing a A that selects at most n variables. If 
n variables can not be selected, we select the maximum number of variables the Lasso can select. 
Next, we solve the group Lasso for multiple different choices of the groupings in such a way that 
at most n groups are selected. Again, if n groups can not be selected, we choose the maximum 
number of groups possible. The final step is to take an intersection over all the supports to find an 
estimate S. Some remarks about Algorithm [2] are as follows. 

If n variables can not be selected, the Ao chosen will be Ao = arg maxA 1S'°(A)1. Similarly, for the group Lasso, if 
n groups can not be selected, the Xi chosen will be Xi = argmaxA |5gi(A)|. 
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1. Practial Implementation. When using standard implementations of the Lasso and the 
group Lasso, it may not be computationally feasible for all solutions of the Lasso to have 
support of size less than or equal to n. Thus, in practice, we apply the Lasso for multiple 
different values of A and choose a A for which the estimated support is the smallest above 
n — 1 . A similar step is done for the group Lasso solution. 

2. Why MuG? It is clear that \S\ < |S''^(Ao)|. The main advantage of using MuG is when 
|5| < |S'°(Ao)|. This will happen if the group Lasso discards some of the variables estimated 
by the Lasso. Our numerical simulations in Section [5] show that this is indeed the case. 

3.2 Insight into the MuG Framework 

In this Section, we identify a sufficient condition on the matrix X such that IS*! < |S'°(Ao)|. For 
simplicity, we apply MuG with i^T = 1 so that 

s = s''{Xo)ns\x,), (18) 

where iS''^(Ao) is the support estimated using Lasso and 5^(Ai) is support estimated using group 
Lasso for a grouping Q^. We want to understand the conditions under which S will discard at least 
one element from S^{Xo). Consider the following assumptions. 

(PI) The set S^{Xq) contains the first n elements, i.e., 5''(Ao) = {1,2, . . . 
(P2) Let be defined as follows: 

G} = {j},fori = l,...,n-l 
Gi = {n,n + 1} 

G] = {j + 1} ,iov j = n + 1, . . . ,p - 1 

(P3) If the group Gj is selected, then min^g^ji |/3;| > 0. 

Assumptions (PI) and (P2) simplify notation. Assumption (P3) says that if the group Lasso selects 
a group, then all estimated values in that group are non-zero. Given (P1)-(P3), we want to find 
sufficient conditions under which the group will not be selected by the group Lasso estimator. 

Theorem 3.1. Given (P1)-(P3), if the vectors Xi,X2, . . . , {Xn + ^Xn+i) are linearly dependent 
for allj> 0, then \S\ < \S^{Xo)\. 

Proof Suppose Xi, X2, ■ ■ ■ , {Xn+jXn+i) are linearly dependent for all 7 > and \S\ = |5'^(Ao)|. 
This means that 5'^(Ao) ^ S'-'^(Ai), which implies that the groups G\, . . . ,Gl^ are selected by the 
group Lasso estimator. Using Lemma |3. II and (P3), we have that 

Xl, X2, . . . , {Xn + aXn+l) 

are linearly independent for some a > 0. This leads to a contradiction. □ 

It is easy to generalize Theorem 13.11 when arbitrary groupings are selected instead of the spe- 
cialized grouping defined in (jl9p -(j2ip. The sufficient condition in Theorem 13.11 is fairly strong and 
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Figure 2: Illustration of the adaptive grouping algorithm where we either group one variable from 
5 with variables from S or group variables in S together. 

we conjecture that the condition can be weakened significantly. This will be clear when we present 
numerical simulations. Regardless, Theorem 13.11 gives insight as to why MuG can remove some of 
the variables selected by the Lasso. Informally, if Xn+i is chosen such that it strongly depends on 
some of the columns selected by the Lasso, the group Lasso may not select the group containing 
{n,n + 1} since there may exist another variable n' such that Xi,X2, ■ ■ ■ ,Xn' leads to a better 
approximation . 

3.3 Choosing the Groupings 

This Section addresses the problem of choosing the groupings in the MuG framework. We 
consider two such methods. 

(a) Random Groupings: Partition the index set randomly such that each group in G^ has at most 
m elements. 

(b) Adaptive Groupings: Let S be the current estimate after running the MuG framework i — 1 
times. To construct the grouping randomly group an element in S with at most m — 1 
elements from S . Once all elements in S have been grouped, randomly group the remaining 
elements in groups of size at most m. Figure [5] illustrates this adaptive construction. 

The adaptive grouping algorithm is motivated from Theorem 13.11 In particular, suppose 
Xi, . . . , Xn are selected by the Lasso. From Lemma 13.11 we know that these vectors must be 
independent. This implies that the vectors Xi, X2, ■ ■ ■ , {Xn-i + jXn) are linearly independent for 
all 7 > 0. Thus, if n — 1 and n are grouped together, there is high chance that the estimated 
support will always contain {1, . . . , n}. 

To compare the performance of the two grouping algorithms, consider the linear model in ([T| 
with the following parameters: 

• p = 100, n = 30, k = 5, and a = 1.0. 

• All non-zero elements in /3* have magnitude 1.0. 

• Each entry in X is sampled independently from a standard normal distribution. 

Note that the MuG framework is sensitive to the choice of the K groupings. A different choice of 
the groupings may result in a different output S. To study the properties of S, we fix X and w 
and apply MuG 200 times over different choices of the K set of groupings for both random and 
adaptive groupings. Figure [3] shows the histogram of the cardinality of S. 

From the histogram, it is clear that the adaptive groupings approach results in estimates that 
have lower cardinality than that of the random groupings approach. For all the 400 instances (200 
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Figure 3: Histogram of \S\ when applying MuG 200 times over different choices of the set of 
groupings. The parameters are p = 100, n = 30, k = 5, a = 1, m = 2, and K = 50. In (a), 
we choose the groupings by randomly partitioning the index set in groups of size m. In (b), we 
adaptively choose the groupings in each iteration of the MuG framework as described in Figure [2j 



each for random and adaptive groupings), S contained the true support S*. Furthermore, the 
Lasso always selected 30 variables and the histogram clearly shows that MuG discarded on average 
about half the variables estimated by the Lasso. We refer to Section [5] for additional numerical 
simulations showing the advantages of the MuG framework. 



4. Theoretical Analysis: High-Dimensional Consistency 



In this Section, we find conditions under which the MuG framework in Algorithm [2] is high- 
dimensional consistent so that P{S* C 5) — )• 1 as n,p — oo. For a grouping G'\ let 5*^, b e 
the set of groups that contains at least one non-zero element. Following ( Negahban et al. . 20ld ). 
define the set C{Sgi) such that 



ceiR": E ii?G>.ii2<3 E lie 



(22) 



For a suitable choice of the regularization parameter, iNegahban et al. (|2O10l ) show that the error 
(/3*(A) — /3*) G C{Sgi). Recall the model in ([1]) where X and y are known, /3* is unknown, k is 
the number of non-zero entries in /3*, and w is the measurement noise. Consider the following 
assumptions on t/*, X, w, k, and /3*: 



k < mm{n,p} 
G!- 1 = m , i 



n 



w 



mwi 



n 



l,...,K ,j = 1,2,... ,di 
<1, Vj = l,2,...,p 

~AA(0,a2/„xn) 

>t||c||2, y ^€C{s*gO, i = o,...,K 

. 8a Vk 
= min > 

k£V T 




(23) 
(24) 

(25) 

(26) 

(27) 

(28) 
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(29) 



The parameters r and a are constants. We assume that the parameter p scales with n such 
that p — )• cxD as n — 7- oo. Further, K, k, and m are also allowed to scale with n. 

Theorem 4.1 (Consistency of Screening). // (2^- (2^ hold, then \S\ < n and P{S* CS)^las 
n,p ^ oo, where S is computed using Algorithmic md the probability is with respect to the random 
noise w. 

Proof See Appendix Rl □ 



Remarks: 



1. We discuss the assumptions in ([231) - (1261) . Equation (123]1 ensures that k or more than k 
variables are selected in each iteration of the MuG framework. For notational simplicity, all 
the groups in are chosen to be of the same size in (I24p . Similar results can be obtained 
by appropriately scaling the design matrix X when considering different group sizes. The 
normalization in (]25p is a standard assumption and the upper bound of one can be replaced 
by a constant, which will change the subsequent assumpt ions. The Gaussian as sumption in 
(|26p can be generalized so that w has sub-Gaussian tails ( Negahban et al. . 20ld ). 



2. Equation (1271) on the design matrix X is known as restricted strong convexity (RSC); see 
(jNegahban et al.l . [201ol l for a detailed description of this assumption. Informally, RSC ensures 
that the loss function, \ \y — X(5\\'^ in our setting, has sufficient curvature around the optimal 
solution. Equation (p7|) is impo rtant to prove consiste ncy of the Lasso and the group Lasso 
estimator using the methods in dNegahban et al.l . l20ld ^. We note that alternative conditions 
may also be imposed on X following the analysis in (Van De Geer and Biihlmann . 20091 : 



Bickel et al.. 2009 



to (jNegahban et al 



Meinshausen and Yu . 20091 : Liu and Zhanal . l2009f )~ In addition, we refer 



2010l ) for results on what type of matrices satisy (1270 . 



3. The assumption in (|28p imposes a lower bound on /3min) the minimum absolute value of the 
non-zero entries in /3*. Informally, a small /3min requires more observations for consistent 
estimation using Lasso and group Lasso. It is interesting to see how /3jnm scales with the 
group size ra. If we do not use MuG and simply use the Lasso for screening, (|28]) reducesfl 

to /?„ 

8cr 



> 8ct 



Using MuG with group Lasso increases the lower bound on /3min by 



Thus, although the MuG framework may result in screening such that IS"! < n, 
this comes at the cost of requiring the minimum absolute value in /3* to be larger than that 
required when simply using the Lasso for screening {K = in Algorithmic]). Stated differently, 
using the MuG framework requires an additional (64(T^A:m)/(r^/3^jj^) number of observations 
for consistent screening when compared to the Lasso. We will further explore this trade-off 
using numerical simulations in Section [5l 



4. The final assumption in (I29p says how many times MuG can be applied for consistent screen- 
ing. If m is a constant so that it does not scale with n or p, choosing K = p'^, where 7 < 2, 
is sufficient for consistent high-dimensional screening. 



*See Lemma [A. II in Appendix A. 
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5. Numerical Simulations 



In this Section, we provide extensive numerical simulations to show the advantages of using MuG 
in practice. We assume the linear model in ([1]) with a = 0.5 and consider three different choices of 
the n X p design matrix X. 

• (IND) Each entry Xij is sampled independently from AA(0, 1). We let p = 1000 and n = 
100, 300, or 500 depending on the example considered. 

• (TOP) Each row in X is sampled independently from M{0, S), where S is a p x p covariance 
matrix that is Toeplitz such that Sjj = /i'*"-'', where we choose = —0.4. We let p = 1000 
and n = 100, 300, or 500 depending on the example considered. 



(RL) We use preprocessed data from ( Li and Toh . 2010l ). where p = 587 and n = 148, such 



that each row of X corresponds to gene expression val ues from p genes rela ting to Lymph 
node status for understanding breast cancer treatment dPittm an etall . l2004l ). 



T he matrices in (IND) and (TOP) sati s fy the so called mut ual incoherence property (jZhao and Yu , 



20061 : iMeinshausen and Biihlmannl . bood : IWainwrightl . I2OO9I ) such that exact support recovery 



IS 



possible using the Lasso given sufficient number of observations. The matrix in (C) does not sat- 
isfy mutual incoherence, which means that no matter how many observations are made of /3*, the 
support S* can not be estimated exactly using Lasso. Finally, we always normalize the columns of 
X such that ||Xj||2/-v/ra = 1. We evaluate four possible screening algorithms: 

• MuG: This is our proposed algorithm outlined in Section [3] (see Algorithmic]). 



SIS: This is the sure independence screening algorithm proposed in ( Fan and Lv . 120081 ). Given 



that the columns are normalized, the algorithm computes S by thresholding to = \X'^y\ such 
that S = {i : uji > t} . When comparing MuG and SIS, we choose the threshold so that the 
estimates from both SIS and MuG have the same cardinality. 

LCV: This is cross-validated Lasso, where we select A in ([2]) using cross-validation. We 
randomly chose 70% of the data for training and the rest for testing and applied Lasso 
on a grid of values and repeated this process 50 times. The final A chosen minimized 
the mean negative log-likelihqod o ver the training data. It has been shown theoretically 
(IMeinshausen and Biihlmannl . I2OO6I ) and observed empirically that this method may be used 



to perform screening. In fact, algorithms such as the adaptive Lasso and the thresholded 
Lasso use LCV as the first stage in estimating (3* and S* . 

• MuG+LCV: This computes the intersection of MuG and LCV. The main motivation behind 
using this method is that since both LCV and MuG result in screening and both the methods 
are different, the intersection of the results from both these methods can result in a 5 that 
has lower cardinality. 

We evaluate screening algorithms using the false positive rate (FPR) and the false negative rate 
(FNR), which are defined as follows: 

FPR = and FNR = . (30) 

\S\ \S*\ ^ ' 
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FPR reflects the fraction of variables in S that are not in S* and FNR reflects the fraction of 
variables in S* that are not in S. In general, we want both the FPR and FNR to be as small 
as possible. Section 15.11 discusses results on applying MuG for different choices of the number of 
groupings K. Section [5^2] discusses results on how the MuG estimates depend on the group size m 
and the parameter /3min- 

5.1 Number of Groupings K 

Figures H] and [5] show results on applying various screening algorithms when X is generated as 
described by (IND), (TOP), and (RL) and /3min = 0.5. The x-axis in all figures is the value of K 
and the y-axis is either the FPR or FNR of a screening algorithm. The lines are mean values of 
either the FPR or FNR and the shaded regions are the standard deviation of the FPR. For MuG, 
we always show the FPR using error bars. The LCV method is independent of K, which is why it 
does not change with K. 

Remarks: 

1. We clearly see that as K increases, the FPR decreases and the FNR either remains constant 
or increases at a very small rate. For cases when p ^ n, the FPR decreases at a larger rate 
and the FNR increases at a small rate. This is because, when n <^ p, MuG removes more 
variables in each iteration than when n < p. 

2. We observe that MuG based algorithms perform better than simply using cross-validation 
(LCV) or using the sure independence screening (SIS) algorithm. The difference between 
MuG and SIS is more pronounced in cases where p is much greater than n, see for example 
Figure IHa) and Figure IHd). 

3. Combining LCV and MuG, which we refer to as MuG+LCV, leads to a much smaller FPR, 
while only increasing the FNR by a small amount. On the other hand, simply using LCV 
results in a much larger FPR. For example, in Figure IHa), LCV has a FPR of 0.8 whereas 
MuG+LCV has an FPR of about 0.2. 

4. The difference between the performance of MuG and SIS is more pronounced in Figure [U 
where the matrix X corresponds to real measurements of gene expression values. For example, 
in Figure [5l[|a) , MuG has an FNR of nearly and SIS has an FNR of nearly 0.8. This means 
that for the same cardinality of S, the estimate of MuG contains nearly all the true variables, 
while SIS is only able to retain 20% of the true variables. 



5.2 Size of the Groups m and the Parameter /3min 

In this Section, we present numerical simulations to study the performance of MuG as the size of 
the groupings ni and the parameter /3min change. Figure ElJ^ a) shows results for the (IND) example 
with p = 1000, n = 100, k = 10, and /S^am = 2.0. We applied MuG using different choices of m 
ranging from 2 to 10 and chose K = 100. As m increases, the mean FPR first decreases and then 
eventually increases. The mean FNR increases with m. A similar trend is seen in Figure [6l|b), 
where n = 200. The only difference is that the number of observations are sufficient for screening, 
so the FNR is zero as m ranges from 2 to 6. Both these examples show that choosing ni to be large 
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Figure 4: Results when X is sampled from a Gaussian distribution. See Section [5TT] for more details. 
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Figure 5: Results when X is the matrix of gene expression values. See Figure S] for the legend and 
Section [5.11 for more details. 
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Figure 6: Performance of MuG as the size of the groupings m and /?min change. See Section [52] for 
more details. 



does not necessarily result in superior screening algorithms. In practice, we find that choosing m 
as 2, 3, or 4 leads to good screening algorithms. 

Figure [H^c) shows results on applying MuG to (IND) and (RL) where we fix all the parameters 
and vary /3min- Only one variable in (3* is changed, so it is expected that this particular variable 
will be difficult to estimate when /3min is small. This is indeed the case from the plot in Figure [6l[c) . 



6. Extensions of MuG 



We presented the MuG framework in the context of the linear regression problem in ([T|) with 
a sparsity constraint on (5*. We now briefly discuss how the MuG framework can be used in 
conjunction with some other variable selection algorithms. 

Beyond Lasso: Although we outlined the MuG framework with respect to an abstract variable 
selection algorithm in Section [21 we mainly studied the use of MuG in conjunction with the Lasso 
and group Lasso estimat or. In general i zing MuG to other group based variabl e selec tion algo- 
rithms, such as those in dYuan and Linl . boOfil : lEldar et al.l . bnid : baraniuk etHI . bnid ) , it is not 
immediately clear if MuG will lead to a parameter-free screening algorithm. This will be a subject 
of future research work. 

Structured Sparsity: For many problems, prior knowledge can be useful in constructing better 
estimates of For example, if it is known that /3* is group sparse, group based estimators, such 
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as those in (|Yuan and Linl . I2OO6I : lEldar et alJ . l20ld : iBaraniuk etHI . l20ld ) , can be used to estimate 
/?* using less number of observations. In this case, MuG can be e asily appUed by f orming groupings 



oyer the known groups. For applications in image processing (jBaraniuk et al.l . l20ld : I Rao et al. 



20 111 ), it is natural to assume the variables have a tree-structured sparsity pattern which leads to 
forming a set of overlapping groups. Again, the MuG framew ork can be applied by forming the 
groupings over the overlapping groups and using algorithms in (j.Tacob et al.l . boOflh for solving the 
overlapping group Lasso problem. 

Graphical Model Selection: A graphical model is a probability distribution defined on graphs. 
The nodes in the graph denote random variables and t he edges in the graph denote statistical 
relationships amongst random variables ( Lauritzen . 19961 ). The graphical model selection problem 
is to estimate the unknown graph given observations drawn from a graphical model. One possible 
algorithm for estimating the graph is by solving a Lasso problem at e ach node in the graph to 
estimate the neighbors of each node ( Meinshausen and Biihlmann . 20061 ) . Our proposed algorithm 
using MuG in Algorithm [2] can be used to estimate a superset of the true edges in the graph. There 
are many other algorithms in the literature for l earning graphical mo dels. One method, which 
is commonly referred to as the graphical Lasso ( Banerjee et al. . 20081 ) or gLasso, solves an £1- 
regularized maximum likelihood problem to estimate a graph. The MuG framework can be applied 
to gLasso by placing a group penalty on the inverse covariance. However, in this case, it is not clear 
if parameter-free screening can be done. An alternative method is to assume a conservative upper 
bound on the number of edges in the graph to perform screening. Our future work will explore this 
problem. 

Exact Support Recovery: Our primary interest in this paper was screening, i.e., to estimate a 
superset of the true support. Exact support recovery can be easily achieved by applying known 
algorithms for variable selection once screening has been done. However, it is also of interest to 
study if exact support recovery or nearly exact support recovery can be achieved using the MuG 
framework. This may require assuming some upper bound on the support of /3* and then applying 
MuG with this upper bound. 



7. Summary 

Accurate high-dimensional variable selection using greedy algorithms require selecting appropriate 
tuning parameters that must be chosen using computationally challenging methods such as cross- 
validation or stability selection. Screening, where a superset of the true variables are estimated, 
can reduce the computational burden of model selection by discarding variables that are not in 
the true set with high probability. Unfortunately, the performance of current screening algorithms 
also depends on choosing an appropriate tuning parameter. In this paper, we presented a novel 
framework for screening, which we refer to as Multiple Grouping (MuG), that groups variables, 
performs variable selection over the groups, and repeats this process multiple number of times 
using different choices of the groupings. The final superset of the true variables is computed by 
taking an intersection over all the estimated sets over each grouping. The main advantage of MuG 
over other screening algorithms is that MuG can perform screening in the linear regression problem 
without using a tuning parameter. Theoretically, we proved consistency of the MuG based screening 
algorithm and our numerical simulations showed the advantages of using MuG in practice. We also 
discussed some other applications of MuG beyond the linear regression problem. 
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Appendix A. Proof of Theorem 14.11 

The following two lemmas, proved in dNegahban et al.l . boid ). establish consistency of the Lasso 
and group Lasso solutions. 

Lemma A.l ( Negahban et al. ( 2010l )). Under conditions f25\)-[27\), there exists a A such that the 
solution /3(A) to the Lasso problem in ^ satisfies 

64cr^ k log p 



*l|2 ^ 

2 - ^ 



n 



(31) 



with probability at least 1 — cxji? , where c\ is a constant. 

Lemma A. 2 (jNegahban et al. |(|20l3)). Under conditions ^24\)-[27\), there exists a A such that the 
solution /3*(A) to the group Lasso problem in f75l) with the grouping satisfies 



||/3^(A)-/3 



*l|2 



< 



m 
— + 
n 



logp 



n 



with probability at least 1 — C2/{p/m)'^ , where ci is a constant. 

Using Lemma [ATI there exists a A such that if A; < min{n,p} and 

64cr^ klogp 



Pi 



> 



n 

0/ 



P{S* C ^^(A)) > 1 



then 
ci 



p 



|2 ■ 



Similarly, using Lemma lA. 21 there exists a A such that if < min{n,p} and 

2 



Pi 



> 



IQa^k 



+ 



logp 



then 



n 



C2 



(32) 



(33) 
(34) 



(35) 
(36) 



P{S* <ZS\X);g')>l „. 

Choosing Pmin as in (p8]) ensures that both ([33]) and ([35]) are satisfied. Thus, given ([28]) - ([23]) . 
the Lasso and the group Lasso can select all the elements in S* (and possibly more) with high 
probability. Since Algorithm [2] selects all possible variables that the Lasso and the group Lasso can 
select we have 



p{s* c s\x{) ■,g')>i 



C3 



(p/m) 



0,1,. ..,K., 



(37) 
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where C3 is a constant, Aj is the tuning parameter chosen in Algorithm [21 and we let 
{{1}, . . . , {p}}- To complete the proof, we have 



K 



P{S* C S ; {go, g\..., g^}) = P f]{S* C S{X,)} ; g\ . . . , (38) 



1 - P (^[j{S* 2 5(A,)} ; {go, . . . , g^}^ 



(39) 



>i-^#^ (40) 

We use the union bound to go from (p8|) to ([39]) . Choosing K such that lim ^^J^y} — ensures that 

^(5* QS;{g'^,g^,..., g^}) ^ l as n, p ^ oo. Thus, given a set of groupings {g°, g\ . . . , g^} , we 
have established consistency of the MuG screening algorithm. If the groupings are chosen randomly, 
either using the random grouping or adaptive grouping approaches outlined in Section 13.31 we will 
still get the same consistency result since the bound in (|4U|) only depends on the maximum size of 
the group m. 
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